COHPAHISOS  OF  NATURAL  AND  SLOTTAL 
AREA  WAVEFORM  SYNTHETIC  SPEECH 


By 


MARIANO  HADAL  SURIS 


A DISSERTATION  PRESENTED  TO  THE  GRADUATE  COUNCIL  OF 
THE  UNIVERSITY  OF  FLORIDA 
IN  PARTIAL  FULFILMENT  OP  THE  REQUIREMEKTS  FOR  THE 
DEGREE  OP  DOCTOR  OF  PHILOSOPHY 


UNIVERSITY  OF  FLORIDA 


197S 


Kakta  Gib-to 


ACKNOWLEDGMENTS 


1 wish  to  acknowledge  t 
those  who  made  possible  the  completion  of  this  work- 

First,  I want  to  express  gratitude  to  my  advisor. 

Dr.  Donald  G.  Childers,  whose  extraordinary  devotion, 
professional  competence,  and  sincere  encouragement  have 
make  my  task,  in  my  estimation,  most  fruitful.  I take 
pride  in  bearing  testimony  that  my  association  with  him 
will  provide  a lifelong  influence  and  inspiration. 

I also  wish  to  acknowledge  the  invaluable  assistance 


Moore.  Arnold  Paige.  I*on  B.  couch,  and  Nathan  W.  Perry. 

In  addition,  I am  grateful  to  others  who  assisted 
me.  Thanks  are  due  to  Mrs.  Boswitha  Zamorano,  for  her  un- 
tiring effort  on  my  behalf.  Her  cooperation,  in  typing 
the  manuscript,  made  possible  the  completion  of  the  work 
on  time.  Appreciation  is  expressed  to  Dr.  Beverly  H. 
Hildebrand,  for  her  time  and  labor  epent  on  the  Graf/pen, 
and  to  Dr.  Mllliam  A.  Yost,  for  his  guidance  and  aid  in 
the  procedural  setup  of  the  listener  evaluation  session. 
Recognition  is  due  to  the  University  of  Puerto  Rico, 

financial  support.  In  particular,  thanks  are  due  to  Prof. 


lii 


Rafael  Pletri  Ohms,  Chancellor,  for  his  continues  encourage- 
ment, and  for  the  faith  that  he  has  placed  in  me. 

Finally,  1 wish  to  express  oy  indebtedness  to  my 
parents,  relatives,  and  friends,  especially  to  Dr.  Teodoro 
Mercado  Jimenez,  my  former  teacher  and  present  colleague, 
whose  encouragement  and  help  have  undoubtedly  stimulated 

This  research  was  supported  in  part  by  grant  NS  11790 
from  the  National  Institute  of  Neurological  and  communica- 
tive Disorders  and  Strohe  of  the  National  Institutes  of 


CONTENTS 


CONTESTS  (t 


APPENDIX 


SPEECH  SYNTHESIS  BASED  ON  VOCAL  CORD  MODELING  92 
METHOD  OP  LINEAR  PREDICTION  IN  SPEECH  SYNTHESIS  109 


Example  

LISTENER  EVALUATION  FORM  . . 
DATA  FOR  SUBJECTS  MB  AND  JB 


BIOGRAPHICAL  SKETCH 


Abstract  of  Dissertation  Presented  to  the  Graduate  council 
the  University  of  Florida  in  Partial  Fulfillinent  of  the  Requirements 
for  the  Degree  of  Doctor  of  Philosophy 


COMPARISOIi  OF  NATURAL  AKD  GLOTTAL 
AREA  WAVEFORM  SYNTHETIC  SPEECH 

By 

narlano  Nadal  Surfs 

Chairman:  Donald  G.  Childers 

Major  Departnent:  Electrical  Engineering 

The  objective  of  this  research  was  to  investigate  the 
role  of  the  vocal  cord  vibratory  pattern  in  the  production 
of  natural  voiced  speech.  The  naturalness  of  synthetic 
speech,  derived  from  glottal  area  waveform  excitation,  was 
evaluated  and  compared  to  real  speech  by  conducting  a formal 
listener  session.  Three  tasks  were  established  to  determine 
whether  or  not  the  real  and  synthetic  speech  were  indistin- 
guishable. and  to  judge  the  quality  of  the  synthetic  speech. 

The  glottal  area  waveform,  obtained  from  a pathological 
subject,  was  also  used  to  synthesise  speech.  The  results  are 
presented  in  this  study. 

The  glottal  area  was  measured  from  high-speed  films  ob- 
tained by  photographing  the  vocal  cords  at  an  average  speed 
of  4,400  frames/s.  The  measurement  of  the  glottal  area  was 
semiautomated,  and  the  equipment  included  a Nova  2 minicom- 
puter, a Tektronix  4010  CRT  display  terminal,  a Graf/pen, 
and  supporting  software. 

A semiautomated  software  system  was  designed  to  eyn- 
thesiae  speech  from  the  glottal  area  waveform. 


The  vocal  tract  was  modeled  by  a digital  filter,  whose 
parameters  were  found  from  the  method  of  linear  prediction. 
The  steps  to  derive  the  model  were  as  follows.  The  sound 
phonated  during  the  filming  session  was  recorded  on  magnetic 


tape;  sampled  at  10  KHz;  and  stored  on  computer  dish.  A 
record  of  three  pitch  periods  of  speech  was  then  used  to 
determine  the  linear  prediction  all-pole  filter  model.  Th 
formants  were  extracted  from  the  linear  prediction  model  b 
factoring  it  and  retaining  the  formant  poles.  Formant  ban 
widths  were  changed  according  to  the  current  literature, 
since  it  is  believed  that  the  linear  prediction  method  doe 
not  give  correct  bandwidth  values.  The  factored  filter  wa 
then  remultiplied  and  the  resultant  expression  employed  es 


Synthetic  speech  was  obtained  by  parabolically  inter- 
polating the  glottal  area  measured  from  high-speed  films  to 
an  effective  10  KHz;  periodically  extending  one  glottal 
pitch  period;  incorporating  the  jitter  and  shimmer  parameters 
devived  from  the  original  speech  into  the  resultant  glottal 
area  waveform;  and  exciting  the  vocal  tract  model  with  this 
function.  The  effect  of  radiation  at  the  mouth  was  also 
considered. 

Three  glottal  area  waveforms,  obtained  from  normal  male 
subjects,  were  used  to  synthesize  voiced  speech.  For  the 
listener  evaluation,  a magnetic  tape  was  prepared  containing 
one-seccz^  speech  segments  of  both  the  synthetic  and  real 
speech  obtained  daring  the  filming  session.  The  listener 


evaluation  tasks  showed  thot  the  synthetic  speech  differed 
substantially  from  its  natural  counterpart,  that  is.  they 
were  distinguishable.  Furthermore,  the  judges  demonstrated 
a strong  preference  for  the  real  speech  by  indicating  that 
it  sounded  more  natural  than  the  synthetic  speech.  Based 
on  these  findings  and  on  current  research  being  dene  on 
vocal  cord  modeling,  we  concluded  that  the  area  waveform 
was  not  adequate  for  a complete  description  of  voiced  speech. 
The  reasons  for  this  are  discussed  at  length. 

Results  from  the  pathological  case  Indicated  that  jitter 
and  shimmer,  when  introduced  into  the  glottal  waveform, 
accounted  for  the  harsh  sound  quality  present  in  the  real 


n general,  oonsidersble  trial  and  error  work  had  to 
e in  the  analysis  of  the  pathological  case  due  to  t 
f available  information  concerning  certain  speech  p 


developed  in  the  scientific  community  iti  the  a 


C interest 
nd  medical 


communities  took  an  active  interest  in  this  field.  On  the 
technological  side,  engineers  looked  for  more  efficient  way. 
to  transmit  the  speech  signal  at  lower  bit  rates  while  main' 
taining  its  intelligibility  and,  whenever  possible,  its 
naturalness.  On  the  medical  side,  physiologlste  began  to 
investigate  how  our  vocal  apparatus  worked  in  the  hope  that 

Recently,  we  have  witnessed  a combination  of  efforts 
in  speech  analysis  and  synthesis  on  the  part  of  both  of 
these  scientific  communities.  There  are  many  reasons  for 
this.  Most  speech  synthesizers  are  modeled  after  our  vocal 
mechanisms,  thus  requiring  a knowledge  of  its  physiology. 

It  is  exceedingly  difficult  to  study  the  human  vocal  mecha- 
nism due  to  its  inaccessibility,  thus  physiologists  are  in 
need  of  models  that  can  simulate  vocal  behavior  to  aid  thei 
study  of  our  vocal  mechanism.  Speech  pathologists  are  in 
need  of  automated  procedures  to  process  speech  data  which. 


without  s 


h Riechunlcul  assistance,  requires  large  amounts 
of  hoiDSn  time  and  effort.  This  data  processing  can  bo  done 
effectively  and  quickly  with  the  aid  of  high-speed  digital 
computers.  A flow  diagram  showing  the  interests  common  to 
both  the  engineering  and  medical  fields  in  speech-related 

mechanism  plays  an  important  role  in  most  speech  synthesis 
systems.  Relatively  little  is  known  about  our  vocal  mecha- 
nism because  the  problems  involved  in  studying  its  behavior 
are  multiple.  Hence,  many  models  actually  simulate  the 
vocal  mechanism's  overall  characteristics  without  taking 
into  account  its  detailed  structure.  This  actually  leads 
to  intelligible  synthetic  speech  which  lacks  the  natural 
quality  normally  found  in  human  speech.  It  thus  seems  ad- 
vantagous  to  study  in  more  depth  those  parameters  of  the 
vocal  apparatus  that  contribute  to  the  naturalness  of  speech. 

Figure  1.2  illustrates  a simplified  diagram  of  the 
human  speech  production  mechanism.  The  vocal  cords  can 

speech.  The  orifice  between  the  vocal  cords  is  called  the 
glottis  and  as  they  open  and  close,  that  is,  vibrate,  they 
describe  an  area.  The  volume  velocity  of  air  {rate  of  air 
flow)  that  passes  through  the  glottis  is  modified  hy  the 
pharynx,  mouth  and  nasal  cavities  to  produce  speech.  If 
the  speech  is  produced  as  a result  of  the  vibratory  action 
of  the  vocal  cords  it  is  voiced,  otherwise,  it  is  unvoiced. 
Thus,  all  voiced  speech  sounds  originate  at  the  vocal  cords. 


tract  filter  tnoclel  to  syntheeize  speech.  The  area  waveform, 
obtained  from  a pathological  caso,  was  also  used  to  synthe- 
size speech. 

Figure  1.3  sommarizcs  the  steps  that  were  followed  to 
Summary  of  Results 

The  results  obtained  were  dependent  upon  a number  of 
techniques  developed  in  an  attempt  to  optimize  the  quality 
of  the  synthetic  speech.  These  techniques  are  also  mentioned 
here  so  that  the  reader  can  gain  greater  insight  to  the  re- 
levance of  the  results  obtained. 

It  was  desirable  to  obtain  synthetic  speech  at  a sampling 
rate  of  10  KHz  since  it  is  well  known  that  the  bandwidth  of 
speech  is  of  the  order  of  5 KHz  111.  The  glottal  area  wave- 
form, measured  from  high-speed  films  of  the  vocal  cords,  was 
sampled  by  the  camera  at  approximately  4,400  frames/s.  A 
parabolic  interpolation  scheme  was  Implemented,  among  others, 
to  artificially  increase  the  sampling  rate  of  the  area  wave- 
form to  10  KHz.  Results  showed  that  the  quality  of  the  syn- 
thetic speech  was  affected  by  the  interpolation  of  the  glottal 
area  waveform  since  every  frequency  component  in  the  syn- 
thetic speech  above  2.2  KHz  was  the  result  of  the  interpola- 
tion process. 

For  the  synthesis,  only  a single  pitch  period  of  the 


glottal  waveform 


The  subjects,  while  being  filned,  were  instructed  to 
attempt  to  phonate  an  ]i]  sound.  This  had  the  effect  of 
lifting  the  epiglottis,  which  normally  blocks  the  anterior 
portion  of  the  vocal  cords  from  view.  However,  the  sound 
actually  emitted  was  an  |aol  sound.  It  was  difficult  to 
obtain  any  other  voiced  sound  because  the  filming  process 
required  the  insertion  of  a laryngeal  mirror  into  the  sub- 
ject's  mouth.  The  subjects  were  thus  limited  to  the  pro- 
duction of  simple  voiced  sounds  or  phonations. 

The  vocal  tract  filter  model  also  had  to  undergo  some 
modifications.  The  method  of  linear  prediction  when  applied 
to  the  speech  signal  models  the  vocal  tract,  the  effects  of 
glottal  excitation,  and  the  effects  of  radiation  at  the 
mouth  and  nostrils.  The  resultant  model  is  an  all-pole 
filter.  For  speech  sampled  at  10  KHz  the  order  of  the  fil- 
ter is  usually  between  13  and  16.  In  the  range  of  5 KHz 

the  vocal  tract  filter  model  should  have  4 or  5 complex 
conjugate  poles.  The  other  poles  in  the  linear  prediction 
model  account  for  the  overall  effects  of  radiation  and 
glottal  excitation.  The  linear  prediction  filter  model 
was  then  factored  and  only  the  polee  describing  the  wxal  tract 
were  kept.  The  formant  bandwidths  were  then  altered  to 


e speech  obtained,  after  exciting  t 


laultipliad  (in  Che  z-donain)  by  the  faotor  (l-az  to 
account  for  radiation  at  the  nmuth.  Experimental  results 
showed  that  the  optimum  synthetic  speech  was  produced  for 


Figure  1.4  shows  the  sequence  of  the  different  tech- 
niques developed  in  this  research. 

Three  glottal  area  wavoforms,  obtained  from  normal 
male  subjects,  were  selected  to  synthesize  voiced  speech. 
Three  tashs  were  developed  for  the  listener  evaluation  of 
the  synthetic  speech.  The  different  tasks  permitted  the 
listeners  to  compare  the  real  speech  to  its  synthetic 
counterpart,  and  to  judge  the  quality  of  the  synthetic 
speech  in  terms  of  its  naturalness.  The  first  task  con- 
sisted of  segments  of  both  natural  and  unnatural  speech. 

The  listener's  task  was  to  decide,  after  hearing  each  seg- 

teners  were  presented  with  pairs  of  speech  segments!  each 
pair  contained  a natural  segment  along  with  its  synthetic 
counterpart.  The  listeners  then  selected  the  segment  in 
each  pair  which  sounded  most  like  natural  speech.  The  last 
task  again  consisted  of  pairs  of  speech,  but  this  time  the 
listeners  were  to  determine  if  the 


e results  from  the  listener  evaluation 
howed  that  the  synthetic  speech  differed  substantially 
3 natural  counterpart,  that  is,  they  were  distinguish- 
e judges  demonstrated  a strong  prefer- 


Chapter  Suimriarics 


The  physiology  of  the  vocal  mechanism  played  an  im- 
portant role  for  this  study.  Hence,  many  terms,  normally 
not  encountered  in  the  engineering  literature  are  employed 
in  this  work.  Chapter  II,  based  on  the  glottal  area  func- 
tion. first  describes  the  vocal  mechanism  and  gives  an  ex- 
planation of  the  technique  used  in  this  research  to  measure 
the  glottal  area  waveform.  The  chapter  concludes  with  two 
sections,  one  examining  the  filter  model  used  to  synthesize 
speech  and  the  other  describing  the  synthesis  problem  of 
this  research,  chapter  ill  is  a historical  overview  on  the 
methods  of  speech  synthesis.  The  purpose  of  this  chapter  is 
to  give  a narrative  on  the  development  of  speech  synthesis 


this  field.  The  details  for  synthesising  speech  ate  dis- 
cussed in  Chapter  IV.  Some  of  the  problems  encountered  in 
this  reseerch  are  also  described  and  analyzed.  Chapter  V 
presents  the  results  obtained  from  this  investigation.  The 
first  section  of  this  chapter  describee  the  resultant  syn- 
thetic speech  of  one  subject.  Graphs  are  plotted  so  that 
the  reader  can  compare  the  real  with  the  synthetic  speech. 

The  next  section  tabulates  the  results  from  the  listener 
evaluation  of  the  fiormal  speech  and  the  third  section  presents 
the  reeults  obtained  from  the  pathological  case.  Chapter  VI 
discusses  the  results  obtained  from  this  research. 

Five  appendices  have  been  included.  Appendix  A gives 
a review  of  speech  synthesis  baaed  on  vocal  cord  modeling. 


IS 


The  purpose  for  including  th 

is  in  the  dissertation  is  that 

it  has  played  a relevant  par 

t in  the  analysis  of  the  results 

of  this  research.  Appendix 

B reviews  the  nethod  of  linear 

prediction  , which  was  used  f 

or  formant  extraction. 

Appendices  C and  D givei  respectively^  a listing  of  the 


main  computer  programs  and  a 

listing  of  the  listener  cvalua- 

tion  form.  Appendix  £ gives 

the  synthetic  speech  data  for 

two  of  the  normal  subjects. 

CHAPTER 


The  physiology  of  the  vocal  nochanism  plays  an  important 
rolo  in  the  design  and  implementation  of  speech  synthesisers 
and  cannot  be  overemphasised,  Most  speech  synthesisers,  in 
one  way  or  another,  taodel  our  vocal  apparatus.  There  are 
basically  two  types  of  modeling  devices.  One  class  views 
the  vocal  apparatus  as  an  input-output  device.  This  type  of 
synthesizer  models  the  overall  characteristics  of  the  vocal 
mechanism  but  does  not  take  into  account  structural  details. 

A second  class  of  synthesizers,  called  articulatory  or 
transmission  line  synthesizers,  models  the  vocal  mechanism 
by  dividing  it  into  sections.  Each  section  is  then  modeled 
by  aji  electrical  network. 

The  model  of  the  vocal  mechanism  employed  in  this  re- 
search falls  into  the  category  of  input-output  devices.  The 
excitation  function  for  the  model  is  the  glottal  area  waveform, 
obtained  from  high-speed  photography  of  the  vocal  cords.  We 
shall  briefly  describe  the  physiological  aspects  of  the  vocal 
mechanism  that  played  an  important  role  in  this  study.  This 
chapter  will,  therefore,  describe  the  human  voice  production 
terminology. 


he  glottal  a 


n speech  synthesis  a 


13 


then  discussed.  The  chapter  concludes  with  a restatement 
Definition  of  Terms 

A schematic  diagram  of  the  human  speech  production 
mechanism  appears  in  Figure  1.2.  Speech  is  produced  by 
the  driving  force  of  the  lungs  expelling  air  through  the 
bronchi,  trachea,  past  the  vocal  cords,  larynx,  pharynx, 

tract  is  defined  to  be  the  passageway  starting  at  the 
vocal  cords  and  terminating  at  the  lips  and  nose.  The 
different  sounds  are  produced  according  to  the  physical 
configuration  of  the  vocal  tract. 

Speech  sounds  can  bo  classified  into  two  general  sub- 
headings, voiced  or  unvoiced.  Voiced  sounds  are  always 
produced  by  the  vibratory  action  of  the  vocal  cords.  Thus, 
all  voiced  sounds  originate  at  the  vocal  cords,  voiced 
sounds  are  called  nasals  if  they  are  produced  through  the 
nasal  cavity  (for  exao^le  |m]).  Unvoiced  sounds  are  pro- 
duced with  the  vocal  cords  spread  apart  by  allowing  either 
of  two  things  to  occur.  In  the  first,  the  air  stream 
passes  through  a narrow  constriction  in  the  vocal  tract, 

case,  total  closure  occurs  at  some  point  in  the  vocal 
tract  followed  by  a pressure  buildup  and  rapidly  releasing 
it.  Examples  of  voiced  sounds  include  all  of  the  vowels. 
Examples  of  unvoiced  sounds  are  s (for  the  first  case)  and 


general, 

*e  characterised 


properties  of  the  vocal 

(poles)  and  anti- 
h appear  as  spec- 


tral peaks  In  the  frequency  domain, 

For  nonnosal  voiced  sounds  the  vocal  tract's 
tion  consists  only  of  polos  [1].  To  obtain  e 
location  of  the  formant  frequencies  consider 


vocal  tract  of  a male  adult.  For  a first-order  approxima- 
tion we  can  model  the  vocal  tract  by  a uniform  acoustical 
tube  closed  at  one  end  and  open  at  the  other.  The  formants 
of  this  vocal  tract  model  occur  at  the  natural  frequencies 
fj,  of  the  tube.  In  equation  form  “ (2n-l)c/(4f)  where  e 

length  of  the  tube  (4.  p.  660).  For  the  normal  adult  male 

0 to  5 KHz  we  thus  have  five  equally  spaced  formants.  For 


the  nonidealized  case,  however,  the  vocal  tract  is  a non- 
uniform  acoustical  tube  and  the  formant  location  will  depend 
upon  the  particular  configuration  of  the  vocal  tract. 

The  vocal  cords  are  also  referred  to  as  the  vocal  folds 
and  the  orifice  between  them  is  called  the  glottis.  As  the 


variously  referred  to  as  the  glottal  area  waveform,  glottal 
area  function,  or  simply  glottal  area  (the  term  vocal  cord 
is  sometimes  substituted  in  place  of  the  word  glottal) . For 
voiced  sounds,  tl^  glottal  area  is  a function  of  time,  that  is , 


A(t)  is  oscillatory 


pattern  tends  to  repeat 


U 


periodic,  that  is,  the  basic  wave 
itself  with  occasional  variations  in  amplitude  and  temporal 
fluctuations,  also  known  as  shimmer  and  Jitter,  respectively. 
The  closed  phase  and  open  period  of  one  glottal  cycle 
(pitch  period)  are  defined  to  be,  respectively,  the  inter- 

period  divided  by  the  pitch  period  is  called  the  open  quo- 
tient or  duty  cycle.  A typical  glottal  area  waveform,  raea- 
aured  from  high-speed  films  of  the  vocal  cords,  is  illustrated 

During  the  production  of  voiced  sounds  puffs  of  air  are 
generated  at  the  glottis  as  it  opens  and  closes.  The  air 
puffs  pass  through  the  vocal  tract,  the  configuration  of 
which  determines  the  particular  sound  emitted.  In  order  to 
characterize  the  flow  of  air  through  the  vocal  tract  we  de- 
fine voluma  velocity  as  the  volume  of  air  pet  unit  time.  If 
u(t)  is  the  particle  velocity  at  the  glottis,  we  have  volume 
velocity,  U(t)  = u(t)A(t).  From  this  equation  we  see  that 
the  glottal  volume  velocity  and  area  waveforms  are  related 
by  the  time-varying  factor  u(t). 

Volume  velocity  at  the  glottis  Is  also  referred  to  in 
the  literature  as  glottal  flow.  In  this  work,  the  term 
volume  velocity  will  always  be  taken  to  mean  glottal  flow. 

The  vibratory  action  of  the  vocal  cords  is  caused  by 
a oombination  of  suhglottal  air  pressure  and  the  Bernoulli 
effect.  Assume  thet  the  vocal  cords  are  initially  open. 

As  air  is  expelled  from  the  longs  it  passes  through  the 


gloctiSr  oausing  s drop  in  pressure. 


the  vocal  cords  are  flexible,  they  are  broughtcloser  together 
until  full  closuro  is  achieved,  at  which  time  the  Bernoulli 
effect  ceases  and  the  subglottal  pressure  increases  steadily 
forcing  the  cords  apart.  The  cycle  is  then  repeated  in- 
ducing continuous  oscillations  at  the  vocal  cords. 


Measurement  o 


o Glottal  area  Waveform 


There  are  various  ways  to  measure  the  glottal  area  vavc- 
form.  These  methods  Include  motion  picture  photography, 
glottography,  fiber  optics  and  radiographic  techniques. 
However,  the  method  that  allows  a detailed  examination  of 
vocal  cord  movement  is  that  of  high-speed  photography  ISJ. 
This  photography  is  currently  being  carried  out  at  the 
University  of  Florida's  Department  of  Speech,  Gainesville, 
Florida.  The  glottal  area  waveforms  used  in  this  research 
were  obtained  from  high-speed  films  of  the  vocal  cords. 

Renee,  we  will  limit  ourselves  to  discussing  this  technique. 
For  a review  on  the  previously  mentioned  techniques  the 
reader  is  referred  to  [6-71. 

The  advantages  of  high-speed  photography  of  the  vocal 
cords  are  many.  Parameters,  such  as  vocal  cord  length, 

tailed  examination  of  the  glottal  vibratory  pattern  is 
possible.  The  filming  of  a patient’s  vocal  cords  allows 

ment  and  the  pathologist  can  thus  make  a direct  assessment 
of  vibratory  pattern  for  clinical  diagnosis. 


Thffre  ara,  however,  a number  of  difficulties  involved 
in  the  filming  of  the  vocal  cords.  A laryngeal  mirror  is 
inserted  Into  the  subject's  mouth  to  reflect  the  vocal  cord 
image  onto  the  eamero  lens.  The  subject  must  also  hold  his 
tongue  in  a downward  position  while  being  photogrsphed  to 
avoid  getting  the  tongue  between  the  mirror  and  the  camera 
lens.  Many  subjects  cannot  tolerate  the  laryngeal  mirror 
and  most  experience  some  discomfort.  As  a consequence  the 
subject  can  only  phonate  a limited  nujnber  of  voiced  sounds. 
The  most  common  sound  is  |aej. 

For  this  study,  the  sound  emitted  from  the  sxibject 
while  being  photographed  was  simultaneously  recorded  on 
magnetic  tape.  It  was  also  possible  to  correlate  the  sound 
source  to  the  film,  frame  by  frame,  by  having  a sma] 1 neon 
light  blur  out  a frame  at  the  same  time  that  a solenoid, 
held  next  to  the  subject's  throat,  was  activated  to  produce 
a sharp  sound  picked  up  by  the  microphone.  However,  the 
camera  noise  was  extremely  high  and  this  segment  could  not 
be  used.  A sample,  recorded  juet  prior  to  the  camera  being 
turned  on,  was  used  in  this  study  to  avoid  the  noise. 

A sampling  rate  problem  arose  as  a direct  result  of 
the  variable  camera  speed  and  the  fact  that  the  sampling  rate 
of  the  area  waveform  by  the  camera  was  not  equal  to  that  of 
the  speech  sampling  rate  (10  KHi).  Figure  2.2  shows  the 
operating  characteristic  of  the  16  mm  Fastax  camera  used 
to  photograph  the  vocal  cords.  At  no  point  is  the  camera 
speed  constant  and  the  highest  attainable  camera  speed  is 


approximately  4r600  frames/s.  The  film  segment  that  was 
picked  for  analysis  was  between  70  and  so  feet  of  film. 

This  Interval  was  where  the  minimum  camera  speed  fluctua- 
tions occurred.  The  camera  speed  in  this  interval  was 
approximately  4,400  firames/s.  After  examining  several  inter- 
polation schemes  the  Gregory-Newton  difference  Interpolation 
formula  T8I  was  used  to  increase  the  sampled  rate  of  the 
glottal  area  waveform  to  an  effective  rate  of  10  KHa. 

Usually,  the  epiglottis  covers  a small  portion  of  the 
vocal  cords  anteriorly  and  the  arytenoid  cartilages  also 


posteriorly. 


cords  sornctiraes  blocks  tbe  glottal 


opening,  introduced  some  dlfticulties  in  the  measurement 
of  the  area  function  from  the  film. 

high-speed  films  is  long  and  tedious.  The  glottal  opening, 
first  photographed  on  high-speed  film,  is  traced  on  paper 
and  from  this  the  area  is  measuxcd  manually  by  using  a 
mechanical  instrument  called  a planiraeter.  Assuming  that 
the  person  using  the  planimeter  is  highly  trained,  the  cal- 
culation of  the  area  can  be  done  at  the  rate  of  10  frames/ 


hour.  Recent  techniques,  developed  at  the  University  of 
Florida's  Department  of  Electrical  Engineering,  have  im- 
proved the  efficiency  of  the  area  measurement.  Here,  a mini- 
computer and  other  hardware  are  used  to  measure  the  glottal 


area  by  semiautomated  means.  Rates  of  120  Cramos/hour  are 
now  possible.  This  is  slightly  over  an  order  of  magnitude 
greater  than  the  planimeter  method,  making  larger  data 
analysis  more  feasible. 

The  equipment,  illustrated  in  Figure  2.3,  includes  a 
Hova  2 minicomputer,  a Tektronix  4010  CRT  display  terminal, 
a Graf/pen,  and  supporting  software.  The  Graf/pen  consists 
of  a stylus  and  two  linear  15"  microphone  sensors  mounted  on 
a clear  acrylic  tablet.  The  microphones  and  the  stylus  are 
connected  to  a control  unit  which  is  interfaced  with  the 
computer.  The  two  microphones  serve  as  the  x and  y coordi- 
nates to  the  stylus,  which  generates  sonic  impulses  whenever 
pressure  is  exerted.  A IS  mm  movie  single-frame  projector 


22 


projects  the  film  onto  >i  piece  ol  translucent  paper  placed 
on  the  acrylic  surface.  The  operator  traces  the  glottal 
opening  using  the  Graf/pen  stylus.  The  sonic  impulses 
emitted  by  the  stylus  are  picked  up  by  the  x-y  microphones 
and  the  glottai  outline  is  simultaneously  displayed  on  the 
terminal.  If  the  operator  is  satisfied  with  the  outline 
he  presses  any  key,  except  zero,  on  the  display  terminal. 

The  area  for  the  displayed  frame  is  calculated  and  stored 
on  disk  file.  The  operator  then  advances  the  projector 

tor  is  not  satisfied  with  the  displayed  glottal  outline  he 
presses  the  zero  key  on  the  terminal.  This  tells  the  com- 
puter to  reject  the  frame  and  start  over  again.  Once  on 
disk  file  the  data  can  be  processed  to  extract  parameters 
such  as  jitter,  shimmer,  open  quotient,  closed  phase,  etc. 
The  glottal  area  waveform  illustrated  in  Figure  2.1  was 
obtained  by  using  this  technique. 

The  measurement  of  the  glottal  area  waveform,  using 
the  hardware  described  above,  still  requires  a significant 
amount  of  operator  intervention.  The  anterior  and  posterior 
portions  of  the  vocal  cords  are  usually  occluded  by  the  epi- 
glottis and  the  arytenoid  cartilages,  respectively.  The 
operator  can  either  omit  these  portions 
to  extrapolate  the  glottal  opening. 


n attempt 


fere  with  the  glottal  opening.  At  times  the  film  is  not 
clear  and  it  is  somewhat  difficult  for  the  operator  to  locate 


the  edge  of  the  vocal  cords.  Ail  of  these  factors  can  produce 

form.  Present  research  being  conducted  at  the  university  of 
Florida's  Departments  of  Electrical  Engineering  and  Speech  is 
aimed  at  overcoming  these  difficulties. 


Speech  Synthesis  Based  on  Glottal  Area  Waveform  Excitation 

Investigations  have  already  been  perforiaod  in  modeling 
the  vocal  tract  and  synthesiring  speech  ty  exciting  the  model 
with  various  waveforms.  Rosenberg  191  and  Holmes  110)  have 
both  attempted  to  synthesize  speech  from  vocal  tract  models 
by  driving  them  with  different  glottal  waveforms.  Triangular, 
sinusoidal  and  polynomial  waveforms  were  used.  Based  on 
listener  evaluations  both  of  these  investigators  reported 
that  the  synthetic  speech  obtained  had  lost  its  naturalness 
as  compared  to  real  speech.  However,  the  glottal  area  wave- 
form was  not  used.  In  fact,  the  current  literature  does  not 
seem  to  present  any  information  on  synthetic  speech  derived 
from  glottal  area  waveform  excitation.  There  are  various 
reasons  for  this.  Host  researchers  involved  in  speech  syn- 
thesis are  generally  interested  in  finding  efficient  ways 
to  synthesize  speech.  Their  main  goal  is  an  overall  repre- 
sentation of  the  speech  signal  and  not  a detailed  one. 
Specialised  equipment  is  needed  to  measure  the  vocal  cord 
vibratory  pattern.  Currently,  there  are  not  more  than  5 
places  in  the  United  States  involved  actively  in  vocal  cord 


higti-speed  photography  111). 


Recently,  vocal  cord  models  have  been  developed  by 
Flanagan  et  al.  [12-171  (see  Appendix  A).  The  models  ai 
based  on  physiological  constraints  allowing  one  to  make 

The  vocal  cord  models  made  use  of  the  glottal  area  wave- 
form, derived  from  a system  of  differential  equations,  t 

the  fundamental  differences  between  the  glottal  area  va\ 
form  and  volume  velocity.  The  results  from  Flanagan's 

plain  some  of  the  results  that  were  obtained  in  t 


Statement  of  the  Research  Problem 

It  has  been  reported  by  van  den  Berg  et  al.  [18J  that 
the  volume  velocity  at  the  vocal  cords  is  directly  propor- 
tional to  glottal  area  for  large  openings  and  to  the  cubic 
power  of  area  for  small  glottal  openings.  In  their  one  and 
two  mass  model  of  the  vocal  cords  Flanagan  et  al.  [14-17]  obtained 
noticeable  detailed  differences  between  the  volume  velocity 
and  the  glottal  area  waveforms.  The  area  waveform  was 
generally  more  smooth  and  syiiuiietrio  than  volume  velocity. 

shape  of  the  volume  velocity  waveform  and  had  little  in- 
fluence on  the  overall  shape  of  the  area  waveform.  Soutee- 
tract  interaction  was  particularly  evidenced  by  temporal 
detail  on  the  volume  velocity  waveform  due  to  the  first  for- 


CHAPTER 


An  investigation  of  any  subject  is  not  complete  with- 
out a knowleflge  of  its  historical  aspects.  The  purpose  of 
thin  chapter  is  to  present  an  overview  on  the  development 
of  speech  synthesis  and  at  the  saite  tine  to  review  the  per- 
tinent literature. 

Speech  synthesis  can  be  traced  as  far  back  as  the  late 
1700's.  The  objective  in  speooh  synthesis  is  to  find  a 

standing  systems  hope  to  go  beyond  recognition.  Multiple 
devices  and  algorithms  have  been  ein>loyed  in  an  effort  to 
synthesise  speech.  Basically,  the  main  goal  in  speech  syn- 
thesis is  to  find  an  adequate  model  to  represent  the  vocal 
tract.  Devices  have  ranged  from  primitive  mechanical  syn- 
thesisers to  sophisticated  electronic  ones  and  it  is  now 
passible,  with  the  aid  of  high  speed  digital  computers,  to 
synthesize  speech  using  progcammed  algorithms.  To  Imple- 
ment a raatheraatical  model  of  the  vocal  tract  using  soft- 
ware has  indeed  been  a significant  breakthrough. 

von  Kempelen's  talking  machine  in  the  late  IBth  cen- 
tury was  one  of  the  first  mechanical  devices  to  successfully 
synthesize  speech  [19-211.  An  earlier  device  due  to  Kratzen- 


stein  [191  able  to  imitate  the  five  vowels  of  the 
English  language.  It  consisted  of  a set  of  five  acoustic 
resonatotSf  one  Cor  each  vowel  sound,  where  the  form  of 


each  resonator  resembled  the  human  south  cavity  Cor  the 
sound  being  omitted.  Although  Kratronstein' s resonators 
did  work,  they  were  not  as  sophisticated  in  nature  as  von 
Kempolon's  machine,  which  could  actually  reproduce  various 


A modified  version  of  von  Kempelen's  speaKing  machine 
was  built  by  Sir  Charles  Wheatstone  in  1835  [19-21! , and 
this  in  turn  inspired  Alexander  Graham  Bell  during  the  late 


1800's  to  construct  a cast  model  of  the  human  head  which 
ixicorporatcd  lips,  tongue,  palate,  teeth,  phary.nx  and  velum 
made  out  of  various  different  materials.  Probably  the  moat 
remarkable  aspect  was  that  it  worked.  Bell's  device  was 
able  to  produce  voiced  sounds  and  some  simple  utterances. 

The  years  that  followed  brought  no  major  contributions 
until  Sir  Richard  Paget  in  the  1920's  synthesized  speech 
sounds  using  shaped  resonant  cavities  molded  in  plasticine 
excited  by  vibrating  reeds.  He  showed  that  simple  s 
could  he  produced  by  his  innovation.  He  simply  used  hi 
hands  to  form  different  cavity  shapes.  This  resulted  i 


particular  cavity  shape. 

There  were  probably  many  more 
reported  In  the  literature  during  t 
century  and  the  first  two  decades  i 


century.  All  of 


these  devices  were  mechanical  in  nature  which  made  them  some- 
what cumbersome  to  handle.  Nevertheless,  it  is  still  amatlnQ 
how  many  of  these  contraptions  wore  even  able  to  produce  sim- 
ple sentences  considering  their  many  limitations.  Much  like 
a musical  instrument,  the  quality  of  speech  sounds  produced 
by  these  devices  always  depended  on  how  good  the  operator 

The  transition  between  mechanical  and  electrical  syn- 
thesizers is  singled  out  by  two  events.  The  invention  of 
the  first  eleotrical  analog  of  the  vocal  organs  was  by 
Stewart  (221  in  1922.  Bell  Telephone  exhibited  at  the  1939 
New  York  world's  Pair  the  Voder  (voice-operation  demonstrator) 
invented  by  Homer  Dudley  [23-24J  , which  could  produce  In- 
telligible speech.  Although  Stewart's  analog  device  could 
produce  vowel-like  sounds,  the  first  electrical  synthesizer 
to  actually  produce  connected  speech  was  the  voder  (251.  The 
Voder  consisted  of  ten  contiguous  band-pass  filters,  which 
represented  the  vocal  tract  In  terras  of  formant  frequencies, 
and  contained  two  different  input  sources.  One  was  a white 
noise  source,  to  simulate  fricatives,  and  the  other  a periodic 
buss  oscillator,  to  simulate  voiced  sounds.  The  end  result 
was  that  there  were  about  thirteen  hand  operated  keys,  one 
foot  pedal  for  source  pitch  control  and  even  a wrist  bar  for 
the  selection  of  voiced  or  unvoiced  sounds.  Indeed,  the 
operator  had  to  be  an  expert.  As  a matter  of  fact  the  opera- 
tors had  to  be  trained  for  over  a year  before  learning  to 
use  this  maehino.  Thus,  although  electrical  in 


properly 


experience  to  cope  w 


flame  major  drawback  as  its  mechanicnl 
took  an  operator  with  considerable 
he  machine.  However^  the  importance 
of  the  Voder  and  Stewart's  electrical  analog  device  are 
evident  at  this  point.  First,  they  filled  the  gap  between 
mechanical  and  electrical  modeling  and  second,  their  success 


proved  that  intelligible  speech  could  be  produced  by  elec- 
trical means,  thus  laying  the  foundations  for  present-day 
speech  synthesis.  The  many  devices  that  shortly  followed, 
such  as  Dudley's  Vocoder  [24],  camo  as  a direct  result  of 


In  1950  Uarris  1261,  then  at  Bell  Laboratories,  designed 
a mechanical-electrical  speech  synthesizer.  It  was  probably 
one  of  the  first  machines  to  synthesize  speech  from  speech 
segments  stored  on  magnetic  tape.  After  a certain  number 
of  keys  were  struck,  each  one  corresponding  to  a particular 
speech  sound,  Che  speech  synthesizer  would  chain  all  of  the 
sounds  together.  A time  delay  waa  incorporated  near  the  out- 
put stage  so  that  the  segments  would  be  somewhat  continuously 
connected.  Figure  3.1  shows  the  main  idea  behind  this.  The 
end  result  was  what  Harris  called  standardized  speech  (27), 

of  speech.  Although  sti.ll  primitive  in  nature,  his  device 
set  forth  the  initial  stages  for  speech  synthesis  by  rule 
and  at  the  same  time  made  it  clear  that  it  was  possible  Co 
fully  automate  the  process  cf  synchesiaing  speech. 


During  the  1950's  researchers  started  implementing 
electrical  analog  devices  of  the  vocal  tract.  The  main 
idea  of  these  devices  was  to  approximate  the  characteristics 
of  the  vocal  tract  by  electrical  networhs  supplemented  by  ex- 
ternal control  dials  to  change  its  parameters.  Two  different 
types  of  devices  tor  modeling  the  vocal  organs  developed. 

The  first  of  these,  called  articulatory  or  transmission  line 
analog  synthesisers  [1,21].  view  the  vocal  tract  as  a trans- 
mission line.  The  model  consists  of  dividing  the  vocal  tract 
into  many  sections  and  then  designing  a simple  electrical  net- 
work for  each  section.  Each  section  simulates  the  distributed 
properties  of  a small  part  of  the  vocal  tract.  When  put  in 
caacade  the  end  result  is  a direct  representation  of  the 


e first  articulatory  synthesiser  is  probably  due  to  Dun 
01.  Dunn's  device  consisted  of  25  uniform  T-sections  w 

the  vocal  tract.  A similar  passive  device,  due  to 


Stevens,  Kasowskl  and  Pant  [291/  made  a major  improvement  in 
that  their  device  provided  adjustments  to  simulate  a range 


thirty-five  pi-sections, 


The  second  class  of  raodelsi  called  resononce/  formant 
or  terminal-analog  synthesizers  11,21,30),  view  the  vocal 
tract  as  an  input-output  device.  The  goal  here  is  to  model 
the  device  so  that  its  transfer  function  will  be  a close 
correlate  to  the  formant  structure  of  the  actual  vocal  tract. 
The  earliest  such  device  is  due  to  Stewart  in  1022  [22] , but 
it  was  not  until  19S3  that  the  first  model  to  synthesise 
complete  sentences  was  developed  by  Lawrence  [31] . Currently 
a multiple  of  such  devices  are  under  study  and  development 
132-341. 

At  this  point  it  is  safe  to  say  that  the  turning  point 
of  speech  synthesis  has  been  due  to  the  development  of  the 
digital  computer.  Speech  analysis  and  synthesis,  whether  in 

seconds  mating  it  possible  to  perform  operations  in  real  time 
Such  new  methods  as  speech  synthesis  by  rule,  by  homomorphic 
filtering,  by  analysis-by-synthesis,  and  by  linear  predic- 
tion have  been  developed  primarily  as  a consequence  of  the 
feasibility  to  implement  them  by  computer. 

The  first  method  just  mentioned,  speech  synthesis  by 
rule,  originated  from  Harris'  clasaical  paper  [27J  dealing 
with  what  he  called  standardized  speech.  Basically,  the 

them  to  operate  on  a set  of  discrete  symbolic  inputs  such 
as  phonetic  symbols,  punctuation,  and  printed  text.  Al- 
though the  speech  obtained  by  some  investigators  [26,35] 
through  this  method  has  been  highly  intelligible,  the  main 


drawback  i 


3 frequency  band,  from 
lows  a deeoriptlon  of 
this  type  of  synthesis 


5 unnatural  so 

n advantages  ar 
tion  rate  [36]  and  that  in  using  a s 

such  a system,  K good  application  o 
is  in  the  information  anaworing  services,  whereby  one  picks 
up  the  telephone,  dials  a reference  number  and  asks  a gues*- 
tion.  The  computer  processes  the  (spoken)  question  and 

Homomorphic  analysis-synthesis  of  speech  was  first 
presented  in  two  published  works  I37-38I  in  1968-69.  Figure 
3.3  illustrates  the  synthesizer’s  configuration.  The  low- 
time cepstral  information  C(nT)  is  actually  tho  windowed 
inverse  Fourier  transform  of  the  log  magnitude  of  tho  Fourier 
transform.  Figure  3.4  shows  this  implementation  in  simple 
block  diagram  form.  Oppenheim  reports  [38]  that  this  system 
can  produce  good  high  quality  natural  speech,  we  also  see 
that  homomorphic  filtering  is  related  to  cepstrum  analysis. 

The  method  of  linear  prediction  and  speech  synthesis 
based  on  vocal  cord  modeling  are  described  in  Appendices  B 
and  A respectively.  They  are  mentioned  here  only  briefly 
to  place  them  in  perspective  with  the  previous  discussion. 

The  linear  prediction  method  was  studied  by  Markel  [39] , 
Itakura  and  Salto  (40) , Atal  and  Hanauer  [41] . and  a complete 
review  was  presented  by  Makhoul  and  Wolf  [421  and  Markel  and 


COKTINUOUl 


STORED  STORED 


- LINGUISTIC  K 


f speech  synthesis  by  rule  (after  L35]). 


Linear  prediction  can  actually  be  oornparcd  to  the  method 
of  spectral  analysis-by-syntheeis  first  introduced  at  M.I.T. 
and  Bell  Telephone  Laboratories  in  1961  E431-  The  analysis- 
by-synthcsia  technique  tries  to  find  a spectral  fit  to  the 
speech  spectrum  that  consists  of  poles  and  zeros.  An  error 
is  then  defined  and  the  fit  optimized  by  itiniinizing  the  error 
in  the  mean-squared  sense.  The  main  difference  between  the 
two  formulations  is  that  in  spectral  analysis-by-synthesis 
the  error  minimisation  is  done  on  the  integrated  logarithm 
of  the  ratio  between  tho  model  and  actual  spectra  whereas 
in  linear  prediction  it  is  done  on  the  integrated  ratio  of 
the  two  spectra.  In  general,  the  linear  prediction  error 
a leads  to  a better  spectral  envelope  fit  1441. 


Pitrari?  3.3  Heptesentacion  of  the  synthesizer  configuration 
(after  1301). 


Essentially,  the  laethod  of  linear  prediction  models 
the  vocal  tract  and  the  overall  effects  of  radiation  at  the 
mouth  and  nostrils  and  of  glottal  excitation  by  an  all-pole 
filter.  This  filter  has  the  form 


where  z = e“‘,  s = o + jio  is  a complex  frequency  variable, 
T = 1/f-  is  the  sampling  period  and  f is  the  sampling 
frequency. 

He  see  that  the  linear  prediction  model  is  a unique 
one  since  it  yields  a single  filter  with  all  of  the  above 
mentioned  properties.  The  fact  that  tho  filter  does  not 


’■*  ssr:s:;;?"  "■  “• 


illt" 


' ill 


!iS  S 

se  > B i£ 


.i^sl 

|||JS 

sj-^1 

g ^1 

‘if 

t'" 

ll« 

V.l 

S|2S„ 

siiiE 

■i  is 

llil 

il-s 

«“^=“ 

= L 

• “is  - 


PROCEDURE 


METHODS  OF 

There  wore  two  major  parts  in  this  research,  one  was 
the  development  and  implementation  of  the  synthesis  tech- 
nique employing  Che  glottal  area  waveform  to  excite  a dig- 
ital filter  model  of  the  vocal  tract.  The  other  was  the 
listener  evaluation  of  the  synthetic  speech.  This  chapter 
discusses  the  detail  of  both  of  those  procedures. 

Procedure  for  Svnthesitinu  Speech 

Three  normal  male  subjects  were  filmed  for  this  research. 
They  were  ashed  to  phonate  a sustained  |i[  at  their  normal 
speaking  frequency  during  the  filming  process  (the  sound 
that  was  actually  voiced  was  an  |ae|  which  is  believed  due 
to  the  laryngeal  mirror  inserted  into  the  subject's  mouth). 
However,  two  of  the  subjects'  vocal  cords,  while  phonating 
at  their  normal  speaking  frequency,  were  not  sufficiently 
exposed  to  allow  an  adequate  view  for  photographic  purposes 
(remember,  the  epiglottis  covers  the  cords  anteriorly,  and 
the  arytenoids  posteriorly) . The  trade-off  was  to  Increase 
their  fundamental  frequency  and  thus  obtain  a better  view 
of  the  vocal  cords  at  the  cost  of  a decrease  in  temporal 
resolution  per  pitch  period  of  the  area  waveform. 

39 


Idedllyr  the  natural  speech  segrnent  used  should  have 
been  the  one  obtained  aimultancous  with  the  filming.  How- 
ever, camera  noise  was  extremely  high  making  it  impossible 
to  use  this  segment.  The  subjects  were  thus  asked  to 
phonate  before  the  initiation  of  the  camera  and  this  seg- 

was  taken  to  assure  that  this  speech  sample  had  the  same 
pitch  period  and  intensity  level  as  that  during  the  filming. 

The  natural  speech  was  lowpassed  with  a cutoff  at  5 KUz, 
sampled  at  10  Kllz  directly  off  the  magnetic  tape,  and  stored 
on  a computer  disk  file  for  further  processing.  This  allowed 
us  to  derive  the  vocal  tract  filter  and  to  extract  the  jitter 
and  shimmer  parameters. 

The  method  of  linear  prediction  i3,A2]  was  used  to  extract 

(the  linear  prediction  method  is  discussed  in  Appendix  B) . 

The  filter,  as  derived  by  linear  prediction,  could  not  be 
used  directly  since  the  filter  models  conjunctly  the  vocal 
tract  and  the  effects  of  glottal  excitation  and  of  radiation 
at  the  mouth  and  nostrils. 

The  flow  diagram  in  Figure  4.2  illustrates  the  method 
employed  in  this  research  to  construct  the  filter  model  of 
the  vocal  tract.  Initially,  the  input  parameters  needed 
were  approximately  3 pitch  periods  of  speech  data  and  the 
order  p of  the  linear  prediction  filter.  The  linear  pre- 
diction filter  was  then  derived  and  its  denominator  fac- 


minicomputer.  Appendix  C 
To  illustrate  the  vo 


lives  a source  listing  tollowed 
t the  program. 

il  tract  modeling  technique 
developed  in  this  roacarch.  consider  Che  plots  shown  in 
Figure  4.3.  Figure  4.3a  is  the  frequency  response  of  a 
16th  order  linear  prediction  filter  derived  from  an  18  ms 
speech  segment  of  subject  J3.  Essentially,  the  linear 
prediction  filter  models  the  envelope  of  the  speech  spec- 
trum 144,47),  This  is  why  it  contain 
glottal  excitation  and  of  radiation. 

Figure  4.3a  from  the  additional  r 
the  filter's  frequency  response.  The 
frequencies  were  hept,  reducing  the  or 
to  p B 8.  and  the  formant  bandwldths  w 

model).  The  resultant  frequency  response  o 
filter  is  shown  In  Figure  4.3b. 

The  jitter  and  shimmer  parameters  introduced  into  the 
synthetic  speech  were  obtained  from  the  original  speech 
segment.  There  ere  multiple  ways  to  derive  these  parameters 
from  the  speech  signal.  Kc  one  method  seems  to  be  optimum. 

waveform  by  calculating  the  difference  of  the  maximum  ampli- 
tudes in  successive  pitch  cycles.  The  jitter  may  be  found 
by  first  defining  the  period  of  each  cycle  of  the  wave- 
form to  be  the  distance  in  time  between  the  peak  amplitudes 
of  and  C^.  If  the  speech  waveform  consists  of  N cycles 


.5  is  apparent  in 
peaks  present  in 
iirst  4 formant 

.re  altered  (usually 
linear  prediction 
modified 


(b) 


Figure  4.3  Frequency  response  o£  a)  linear  prediction  filter, 
p ■ 16  and  b)  jtodified  filter,  p ■ B. 


where  Tj  is  the  jth  period  and  is  the  jitter  parameter. 
However,  for  sampled  speech  these  parameters,  as  defined 
above,  were  found  to  be  inadequate  when  incorporated  into 
the  synthetic  speech.  The  problem  was  that  the  sampled 
peak  amplitude  in  each  cycle  was  not  a sufficiently  good 
approximation  to  the  actual  peak  amplitude,  especially  for 
the  higher  frequency  speech  waveforms.  This  introduced 
abrupt  variations  in  the  shimmer  and  jitter  between  suc- 
cessive pitch  cycles  causing  the  synthetic  speech  to  sound 
unnaturally  harsh  148-49J. 

The  jitter  and  shimmer  parameters  employed  in  this 
research  were  found  by  a correlation  analysis  on  the  speech 
waveform,  assuring  a smoother  variation  in  these  parameters 
from  cycle  to  cycle.  Figure  4.4  illustrates  this  method 
and  the  manner  in  which  these  parameters  were  incorporated 
into  the  synthetic  speech.  The  shimmer  was  derived  by  cal- 
culating the  power  in  each  speech  cycle  and  multiplying  this 
by  the  corresponding  glottal  cycle.  Mathematically,  if  Pj 
is  the  power  in  the  jth  speech  cycle  and  is  the  jth 
glottal  cycle,  which  is  a function  of  time,  we  have 


is  the  glottal  area  waveform  w 


letting  g.  = gtnT) 


where  T is  the  pitch  period  of  g(nT). 

The  fitter  parameter  was  derived  by  considering  3 
succeeeive  apeech  cycles  C^,  and  The  cross- 

correlation  functions  and  tj+j  were  found,  respectively. 


^jtl'  “ 


•-j*l 


J+2' 

...  respectively. 


1 jth  glottal 


form.  The  jitter  parameter  was  Incorporated  i 
waveforin  by  augmenting  Che  clOROd  phase  of  eac 
cycle  by  A FORTRAN  program,  called  JITSHIM  tdue  to  Dr. 

A.  Paige,  Department  of  Electrical  Engineering  faculty  mem- 
ber, University  of  Florida) , was  modified  and  used  to  ex- 
tract the  jitter  and  shimmer  parameters  from  the  speech  wave- 
form. The  source  listing  appears  in  Appendix  C. 

was  applied  to  the  volume  velocity  wave  from  the  vocal  tract 
filter  madel  as  shown  in  Figure  4.5.  In  general,  a value  of 

radiation  is  a good  one  for  frequencies  below  4 KHz  [1,50). 

The  effect  of  d-z"^)  is  to  differentiate  the  volume  velocity 
wave  U^(nT).  Alternatively,  the  effect  is  that  of  high  pass 
filtering  where  the  higher  frequency  components  are  emphasized 
and  the  lower  ones  dampened. 


In  the  first  tash  the  listeners  were  presented  individual 
segments  of  synthetic  and  natural  speech.  All  6 segments 
of  speech  were  used  (the  3 synthetic  segments  with  their  3 
corresponding  natural  counterparts]  and  each  segment  was 
repeated  15  times  for  a total  of  90  speech  segments.  The 
segments  were  distributed  in  random  order  and  the  spacing 
between  each  segment  was  4 s.  The  90  segments  were  divided 
into  5 groups,  each  containing  16  speech  segments.  The 
spacing  between  each  group  was  15  s.  After  hearing  each 
segment  the  listeners  responded  whether  they  felt  the  seg- 
ment sounded  natural  or  unnatural.  The  purpose  of  this 
task  was  tc  judge  the  quality  of  the  synthetic  speech.  The 
natural  speech  segments  were  also  included  to  minimize 
listener  bias  and  to  learn  how  natural  was  the  natural  speech. 

The  second  task  consisted  of  96  pairs  of  speech  segments. 

each  pair  were  5 s and  .6  s,  respectively.  Each  pair  was 
composed  of  a synthetic  speech  segment  along  with  its  natural 
counterpart  (the  speech  segment  recorded  during  the  filming 
session).  The  96  pairs  were  broken  into  8 groups  of  12  pairs 
and  were  randomly  distributed.  The  spacing  between  the  groups 
was  15  B.  For  this  event  the  listener's  task  was  to  select 

speech.  Thus,  the  objective  in  this  task  was  to  compare  the 
synthetic  speech  to  the  natural  speech  and  to  learn  whether 
the  synthetic  speech  had  sufficient  natural  qualities  to  make 
it  indistinguishable  (in  terms  of  naturalness)  to  natural 


The  third  task  vas  sijiiilar  in  its  structure  to  the 
second  task.  The  difference  was  that  24  pairs  were  used 
(2  groups  of  12)  and  the  listener's  task  was  to  reply 
whether  the  samples  within  each  pair  sounded  the  same  or 
different. 


The  first  tas)c  can  be  compared  to  a yes-no  experiment 
and  the  second  task  to  a two-alternative  forced-choiced 
(2APC)  procedure  (51-52).  In  a yes-no  task  the  listener 
is  presented  a single  stimulus  and  must  make  a decision  on 

procedure  the  listener  is  presented  with  2 samples  and  must 
select  which  of  the  2 contains  the  desired  characteristics. 
The  formal  listener  evaluation  session  was  held  in 

A total  of  6 female  students,  between  the  agee  of  20-26, 
from  speech-related  sciences  participated  in  the  experiment. 
All  of  the  students  were  at  the  advanced  undergraduate 
level  and  all  had  some  previous  experience  in  clinical  work 
dealing  with  voice  disorders. 

The  listeners  were  seated  in  a quarter-circle  and  the 
speaker  [raediunt-sised  Mvent  speaker)  was  placed  in  the 

sound  pressure  meter  was  used  to  assure  that  the  speaker 

(60-65  db) . In  addition,  the  amplifier  (Kenwood,  model 
KA-7002)  was  equipped  with  filters  and  the  signal  from  the 
tape  recorder  (Bevox  77A)  was  highpassed  with  the  cutoff 


An  evaluation  form  was  handed  to  each  of  the  6 listeners 
(a  listing  of  the  evaluation  form  appears  in  Appendix  D1  and 
separate  instructions  were  given  before  each  task.  The  in- 
structions included  playing  the  first  3 segments  of  each 
task  as  an  example.  This  was  done  to  avoid  having  to  delete 
the  first  few  segments  of  each  task  in  the  final  tabulation 
of  the  results  (the  first  few  samples  are  generally  elimi- 
nated because  they  usually  do  not  reflect  subject  criteria). 
For  the  first  task,  the  listeners  were  instructed  to  circle 
K on  their  evaluation  form  if  the  speech  segment  sounded 
natural  or  U if  it  sounded  unnatural.  The  listeners  were 
instructed,  for  the  second  task,  to  circle  the  element  in 
the  pair  (a,b)  that  sounded  most  like  natural  speech.  For 
the  third  task,  they  were  instructed  to  check  off  on  their 
evaluation  form  the  word  same  if  the  samples  within  the 
pair  seemed  to  be  identical,  and  the  word  different  if  they 
felt  the  samples  were  distinguishable. 


Procedure  I 


r Pathological  C 


Glottal  area  waveforms,  derived  from  both  normal  and 
pathological  larynxes  have  been  studied  in  a limited  way 
(7,53-551.  The  limited  amount  of  data  collected  ia  due 
primarily  to  the  difficulties  involved  in  accessing  the 
vocal  cords.  For  the  case  of  high-speed  photography  the 

photographing  a pathological  case  is  made  even  more  dif- 
ficult due  to  the  pathology  itself.  Usually,  hours  of 


Much  of  the  experimental  work  in  this  investigation 
was  on  a trial  and  error  basis.  Several  methods  for  in- 
corporating jitter  and  Bhimnor  into  the  synthetic  speech 
were  triedr  formant  bandwidths  were  altered,  various  tech- 
niques to  interpolate  the  glottal  area  waveform  were  tried, 
and  the  constant  in  the  radiation  factor  fl-aa  was 
optimized  based  on  a subjective  criterion.  It  was  not 
feasible  to  exhaust  all  possibilities.  However,  the  proce- 
dure followed  was  not  a random  one  and,  whenever  possible, 
theoretical  explanations  were  always  sought  to  describe  ex- 
perimental phenomena. 

The  method  used  to  measure  the  glottal  area  was  de- 
scribed in  Chapter  II.  Basically,  the  system  employed  was 
semiautomated  whereby  the  operator  traced  the  glottal  opening 
using  a specialized  stylus,  and  the  computer  automatically 
calculated  the  area  and  stored  it  on  disk.  Other  methods 
have  been  employed  here  (at  the  University  of  Florida)  [56-58J 
and  elsewhere  [59-601,  Presently,  work  is  being  done  here 


glottal 


and  shlramor  parameters.  The  resultant  waveform  was  used 
to  synthesize  speech. 

The  glottal  area  pitch  period  employed  for  subject  JH 
and  its  corresponding  FFT  are  shown,  respectively,  in 


Figures  5.1a  and  5.1b.  The  area  wavefcm  shown  in  this 
figure  was  interpolated:  thus,  every  freguency  component 


result  of  the  interpolation  process, 
glottal  period  and  its  FFT  are 
Figures  5.1c  and  S.ld. 


4,400  frames/s,  is  the 
The  interpolated 
respectively,  in 


The  linear  prediction  filter  was  Chen  designed  from  a 
24  ms  segment  of  the  original  speech  record.  From  this 
model,  the  first  4 formant  conjugate  pairs  of  poles  were 
extracted  and  a new  filter  formed.  This  was  accomplished 
by  factoring  Che  linear  prediction  filter  model  and  keeping 
only  the  formant  resonant  poles.  The  freguency  responses 
of  the  linear  prediction  filter  and  the  modified  vocal  tract 
filter  are  illustrated  in  Figure  5.2.  The  amplitude  spec- 
trum of  the  linear  prediction  filter  falls  off  at  about  6 
db/octave  faster  than  that  of  the  modified  model.  The  reason 
is  that  the  linear  prediction  filter  contains  the  overall 
effects  of  the  glottal  source  and  of  radiation.  The  effect 
of  glottal  excitation  is  to  produce  a falloff  of  12  db/octave 
[fill,  kadiaticn  produces  a rise  of  6 db/octave.  Thus,  the 
combined  effects  of  both  source  and  radiation,  not  present 


e modified  filter  model. 


C for  t 


i additional  f 


predictor  filter. 


Figure  5.2  Frequency  response  of  &)  linesr  prediction  filter 
of  order  p ■ 14,  and  b>  codified  vocal  tract 


'^K=a;‘s.„ 


informal  listening 


It  was  determined,  by 
the  value  of  a in  the  radiation  factor  fl-az  that  pro- 
duced the  most  natural  sounding  speech  for  subject  JH  was 
at  a - .7.  The  value  a * 1,  which  is  normally  employed  to 
simulate  the  effect  of  radiation,  could  not  be  used  because 
this  would  have  overemphasised  the  higher  frequencies.  The 
high-frequency  coir^sonents  (greater  than  2-2  KHz)  were  a 
direct  result  of  interpolating  the  glottal  area  waveform. 
These  components  made  the  synthetic  speech  sound  highly  un- 
natural. A trade-off  was  thus  made  to  use  the  value  a ■ .7, 
for  subject  JH,  at  the  cost  of  not  fully  incorporating  the 
effects  of  radiation  into  the  speech  signal. 

The  natural  speech  and  synthesized  speech  of  subject  JH 
ace  shown,  respectively,  in  Figures  5.3a  and  5.3b.  Rather 

more  enlightening  to  compare  their  spectra.  The  reason  for 
this  is  that  the  natural  speech  had  a significant  amount  of 
phase  distortion  as  a result  of  the  process  of  recording  it, 
and  playing  it  bach  in  order  to  digitize  It.  Figures  5.4a 
and  5.4b  show,  respectively,  the  spectra  of  the  real  and  syn- 
thetic speech  signal.  The  following  observations  can  be  made 

i)  The  2 spectra,  that  is,  the  natural  speech  spec- 
trum and  the  synthetic  speech  spectrum  were  similar 
up  to  2.2  KHz.  For  frequencies  beyond  2.2  KHz, 
they  differed  markedly,  To  appreciate  this  better, 
the  2 spectra  have  been  superimposed  in  Figure  5.5. 
ii)  For  frequencies  greater  than  2.2  KHz,  the  synthetic 
speech  spectrum  had  less  energy  content  than  that 
of  the  real  speech.  This  can  be  partially  explained 
by  the  fact  that  the  effect  of  radiation  was  not 
completely  modeled. 


es 


(a) 


Figure  5.4  Spectrum  of  a)  natural  speech,  and  bl  synthetic 
speech  of  subject  JH.  A Hanning  window  was  used. 


liillSiilill' 


S~“ 
ssrsESU  SF™!. 


fTsf 


Ji !i IJ li_ 


tb) 


Figure  S.7  Frequency  resFcnse  of  a)  linear  prediction 

filter  of  order  p = 18,  and  b)  modified  vocal 
traot  filter  of  order  p ■ 8 of  pathological 
subject. 


Figure  S.B 


)logi< 


ibjcct. 

eubjeet. 


(h) 


Figure  S.9  Spectruai  of  a)  natural  speech,  and  bl  synthetic 
speech  of  pathological  subject.  A Hanning 
window  was  used. 


CHAPTER 


DISCUSSION 

The  steps  to  the  objectives  of  this  research  are 
summarized  in  Figure  1.3,  which  is  shown  again  here 
[Figure  6.1)  for  convenience. 

The  glottal  area  waveform,  measured  from  high-speed 
films  of  the  vocal  cords,  was  used  to  excite  a digital 
filter  model  of  the  vocal  tract.  The  synthetic  speech 
derived  from  the  speech  production  model  was  then  com- 
pared to  the  original  speech  recorded  at  the  filming  ses- 
sion. A formal  listener  session  was  held  to  judge  the 
synthetic  speech  in  terms  of  its  naturalness  and  to  com- 
pare the  synthetic  speech  with  the  real  speech  obtained  at 
the  filming  session. 

The  results  from  the  formal  listener  session  showed 
that  the  synthetic  speech  was  not  perceived  as  natural, 
vmen  presented  in  pairs  (synthetic  speech  versus  its  natural 
counterpart),  the  listeners  preferred  the  natural  over  the 
synthetic  epeech  9U  of  the  time,  when  the  segments  were 
presented  individually,  the  listeners  judged  the  synthetic 
speech  as  sounding  natural  20%  of  the  time,  and  the  real 
speech  was  judged  natural  71.5%  of  the  time. 


s. -snys:?  if  ssr 


tho  conclusions  reached  on  che  adequacy  of  synthetic  speech 
derived  from  glottal  area  excitation  and  gives  reeommenda-- 


Glottal  ft 


The  results  of  the  synthetic  speech  derived  from  glottal 
area  wavefono  excitation  leads  one  to  conclude  that  the  area 
waveform  may  not  contain  all  of  the  parameters  necessary  for 
the  production  of  natural  speech.  Alternatively,  the  dif- 
ference between  the  area  waveform  and  the  corresponding  volume 
velocity  may  be  significant  In  terms  of  characterizing  the 
detail  which  contribute  to  the  naturalness  of  spoked  speech. 

It  can  be  argued  that  the  glottal  area  measured  from 
high-speed  films  did  not  reflect  the  true  glottal  area  wave- 


i)  The  anterior-posterior  portions  of  the  vocal  cord 
aperture  were  often  extrapolated  by  the  operator. 

ii)  Operator  approximations  in  the  measurement  of  the 
glottal  area  were  made  due  to  unclear  film,  mucus 
on  the  vocal  cords,  and  uncertainty  as  to  the 
location  of  the  glottal  edge. 

iii)  The  camera  sampling  rate  of  approximately  4500 

frames/s  was  inadequate.  Thus,  the  high  frequency 

area  waveform,  were  not  present. 

iv1  A short  segment  was  not  an  adequate  representation 
of  the  glottal  area  waveform.  The  process  of 
periodically  extending  glottal  cycle,  and  incor- 
porating the  jitter  and  shimmer  parameters  derived 
from  the  original  speech  could  have  deteriorated 
the  synthetic  speech.  It  would  have  been  more 
realistic  to  measure  one  second  of  glottal  activity 


The  uniqueness  of  their  madol  was  that  it  incorporated 
source-tract  interaction  and  it  was  baaed  on  4 physiological 
constraints;  subglottal  pressure;  cord  tension;  rest  area 
of  glottal  opening:  and  vocal  tract  shape.  These  parameters 

human  vocal  nechanisiii.  Thus,  their  rain  goal  was  to  in- 
corporate into  a model  all  the  parameters  necessary  for  the 
production  of  natural  speech,  which  included  source-tract 
interaction. 

Other  related  research  has  been  done  in  vocal  cord 
modeling  and  similar  observations  of  the  differences  between 
the  area  and  volume  velocity  waveforms  have  been  made  [63-641 . 
From  the  results  of  vocal  cord  modeling,  the  effect  of  tract 
interactions  is  evident  in  the  voiume  velocity  and  not  In 
the  area  waveform.  Some  researchers  argue  (1,451  that  it  is 
important,  if  one  is  interested  in  the  production  of  natural 
speech,  that  the  synthesis  system  includes  source-tract 


insensitive  to  tract  interaction,  then 
that  the  area  waveform  is  not  adequate 
cription  of  voiced  speech. 


vibratory  pattern  is 
it  can  be  concluded 
for  a complete  des- 


Synthetic  and  Real  Speech  Remarks 

There  were  several  factors  that  played  a significant 
role  in  the  naturalness  of  Che  synthetic  and  real  speech. 
These  factors  consisted  of  the  decrcmental  effects  on  the 


paralysed  for 


explained  reason. 

The  second  subject  (PM)  had  an  acute  upper  respiratory 
infection.  The  film,  however,  was  not  clear.  This  made  it 
difficult  tor  the  operator  to  measure  the  glottal  area. 
Furthermore,  the  vibratory  pattern  of  the  vocal  cords  was 
out  of  phase,  that  is.  as  one  cord  was  making  a lateral 
excursion,  the  other  was  making  a medial  excursion.  Thus, 
the  operator  was  never  certain  when  Che  maximal  glottal 
opening  occurred.  For  these  reasons,  this  film  was  not 
considered  quantitatively  in  this  study. 

The  treatment  of  the  pathological  subject’s  data 
followed  closely  that  of  the  data  derived  from  the  normal 
subjects.  The  synthetic  speech  wes  obtained  by  considering 
one  pitch  period  of  the  glottal  area  waveform  (fifth  cycle 
in  Figure  5.6).  The  pitch  period  was  periodically  extended 
and  the  jitter  and  shimmer  parameters  incorporated.  The 
formant  bandwidths  were  altered  in  a manner  similar  to  that 
used  far  the  normal  subjects.  For  this  study,  the  experi- 
mental data  by  Fujlmira  and  Lindquist  (31  on  formant  fre- 
quency and  bandwidth  were  used  for  the  normal  subjects. 
However,  no  related  Information  on  pathological  subjects 
is  known  to  exist.  This  complicated  the  analysis  and  intro- 
duced much  trial  and  error  in  attempting  to  arrive  at  satis- 
factory formant  bandwidths. 

The  synthetic  speech  derived  by  filtering  the  glottal 
area  waveform  with  the  vocal  tract  model  is  shown  in  Figure 
5.8b.  The  original  speech  is  shown  in  Figure  5.8a. 


'g  s:;g  "rss:  r,z 


ri^re  6.2  Frequency  response  of  linear  prediction  nodel 
(p«18)  for  two  different  27  nis  speech  seqitents 
of  the  pathological  subject. 


work  remains 


along  the  lines  of  Pujimura  and  Linguist  [3}  for  pathological 
cases  if  one  is  to  be  successful  in  synthesizing  pathological 
speech  which  sounds  undistorted  and  thus  similar  to  the 
original  phonation. 

Concluding  Elemarks 

This  study  was  based  on  investigating  whether  the  glot- 
tal area  waveform  was  sufficient  tor  the  production  of 
natural  speech.  From  the  results  of  the  listener  evaluation 
we  can  conclude  that  this  Is  not  the  case. 

The  question  that  arises  is  it  it  is  worthwhile  to  syn- 
thesize speech  using  the  glottal  area  waveform  as  an  ascita- 
tion  function  to  a speech  production  model.  Alternatively, 
can  more  insight  be  gained  from  the  area  waveform  or  from 
the  derived  synthetic  speech?  It  seems  that  the  glottal  area 
is  a better  vehicle  for  studying  the  vibratory  pattern  than 
its  derived  synthetic  speech.  The  reason  is  that  the  simple 

cord  vibratory  patterns.  Additionally,  to  synthesize  speech 
the  vocal  tract  formant  structure  must  be  derived.  In  the 
case  of  a pathological  subject,  this  can  be  an  almost  im- 
possible task. 

The  method  of  high-speed  photography,  although  more 
tal  activity,  such  as  optical  glottography,  electrical  glocto- 


stIISLrijs 


SH=m-3S;“" 


APPENDIX  A 

SPEECH  SYNTHESIS  BASED  ON  VOCAL  CORD  MODELING 

The  purpose  of  this  section  is  to  review  vocsl  cord 
modeling  as  a tool  for  speech  synthesis.  For  a tutorial 
review  on  vocal  cord  tnodeiing,  the  reader  is  referred  to 
Baer  I65i. 

Most  present-day  techniques  make  the  assumption  that 
the  source  and  vocal  tract  are  linearly  separable.  This 
simplifies  analysis  and  at  the  same  time  leads  to  workable 
models  requiring  few  parameters  to  completely  describe 
them.  In  fact,  many  schemes,  such  as  linear  prediction 
(previously  discussed) , include  the  effects  of  glottal  ex- 
citation into  the  vocal  tract  filter.  This  malces  it  pos- 
sible to  produce  voiced  speech  by  driving  the  filter  with 
a series  of  impulses  separated  by  the  duration  of  the  pitch 
period.  Although  the  speech  obtained  from  these  methods, 
which  assume  linear  separability  between  source  and  tract, 
is  highly  intelligible,  it  sometimes  lacks  the  natural  qua- 
lity found  in  real  speech.  In  order  to  overcome  this  the 
design  should  include  the  details  of  our  vocal  mechanism 
that  are  responsible  for  natural  speech  production.  This 
means  that  all  of  the  parameters  which  play  a significant 
role  in  the  production  of  human  speech  should  be  modeled. 


93 


e,  for  Che  purpose 
as  far  back  as  the  laCe 
h Che  assistance  of  his 
cdel  of  the  human  head. 


Attempts  to  model  the  vocal 
of  s/nthetic  speech  production. 

1000 's  when  Alexander  Graham  Bel 
brother  Melville  constructed  a c 
Their  model  included  lips,  tongue,  palate,  teeth  and  windpipe. 
The  vocal  cord  orifice  was  made  by  stretching  a slotted  rub- 
ber sheet  over  tin  supports.  They  were  able  to  produce  sim- 
ple utterances,  and  although  they  probably  were  not  aware, 
their  model  incorporated  the  effects  of  source-tract  inter- 


The  first  quantitative  self-oscillating  one-mass  model 
of  the  vocal  cords  was  by  Flanagan  and  Landgraf  1141  in  I960. 
The  model  came  about  as  a direct  result  of  previous  work, 
by  Flanagan  himself,  in  attempting  to  model  the  vocal  cords 
112-13] . It  was  implemented  to  produce  synthetic  voiced 
speech  1151  and  later  his  vocal  cord/vocal  tract  speech  syn- 
thesiser was  improved  to  simulate  voiceless  excitation  ]161. 
Another  investigator,  Crystal,  also  produced  a similar  model 
166).  However,  his  model  did  not  account  for  the  closed 
phase  of  the  glottal  pitch  period.  He  thus  limited  himself 
to  analyzing  the  open  period  of  the  glottal  cycle.  As  a con- 
sequence, his  model  was  inadequate  for  speech  synthesis. 

Flanagan's  one-mass  oscillator  model  of  the  vocal  cords 
is  shown  in  Figure  A-1.  In  this  figure  Pg  is  the  subglottal 

pressures,  d is  vocal  cord  thickness,  t is  cord  length  and  A^ 
1s  glottal  area.  Two  masses  are  present,  one  for  each  vocal 


94 


I 


where  Pj,  is  given  by  {A-3). 

Flanagan,  for  his  one-mass  moael,  took  F to  he  equal 
to  the  mean  inlet  and  outlet  pressures  acting  on  the  intra- 
glottal  surface  area.  In  equation  form  we  have 

F(t)  - 5 (Pi  * Pg’  (W)  • (A-6) 

Initially  the  cords  are  at  rest  (x=01  with  the  glottal 


When  motion  of  the  cords  is  initiated,  that  is,  when  F is 
applied,  a point  will  be  reached  where  the  two  masses  (cords] 
will  touch  and  the  glottal  area  (and  hence  volume  velocity) 


Xg  = -Agg/i.  where  is  referred  to  as  the  critical  value 
of  X.  Two  different  situations  are  thus  analysed,  the  open 
period  and  the  closed  phase.  Taken  together  they  constitute 
the  total  glottal  cycle. 


s spring 


dilations  justify! 
as  to  produce  an  undamped  natural  frequency  f of  the  vibrating 
system  well  below  the  fundamental  frequency  when  forced.  For 


stui3y  Flanagan  choae 


The  closed  phase  represents  a totally  different  situa- 
tion. When  the  two  nasscs  collide  two  distinct  cases  can  be 


hypothesized. 


as  can  be  infinitely  I 
viscous.  In  the  first  case,  upon  collision, 
function  takes  the  value  j FgUdl.  This  val' 
since  upon  contact  the  forces  acting  on  the  i 
in  a direction  to  open  the  orifice.  T1 
therefore  be  of  short  duration  and  for 
X > X..  The  second  case  represents  th< 
where  the  vocal  cords  actually  squeeze 
to  their  coitpliant  factor.  The  forcing  function 
is  5 PgCfd)  and  will  be  maintained  until  x > « 

the  forcing  function  will  again  change  to  some  oi 


forcing 
value  is  instantaneous 
he  masses  will  be 
closed  phase  will 

lore  realistic  case 
ito  one  another  owing 
closure 


The  closed  phase  will  thus  be  determined  by  the  total  time 

at  which  X < X . During  closure  the  damping  constant  of 

the  system  will  be  different  from  the  value  at  the  open  period. 

damping  constant  during  the  open  period  and  S'  is  the  damping 
constant  of  the  opposing  mass.  Flanagan  assumed  B'  ■ 2,/fiK, 
the  condition  for  critical  damping. 

Since  in  real  speech  K,  M and  B are  continuously  changing, 
Flanagan  defined  a cord  tension  parameter  Q by  the  relationship 


>'QK/(M/Q) 


(A-ai 


dynamic  system.  Th 
parameters  as  seen 

proceed  to  discuss 

presentation  of  the 
posed  by  Flanagan. 
Bj.  = R^(t)  and  Lg  = 
nonflow- dependent 
resistance  and  the 


the  stiffening  and  lightening  of  the 
! effect  of  Q is  to  scale  the  systems 
in  (A-B). 

>cal  cord  model  has  been  described  we 
its  operation  as  the  glottal  excitation 
lei.  Figure  A-3  shows  the  network  re- 
vocal  system  for  voiced  sounds  as  pro- 
The  three  parameters  R„  ■ R (t) , 

Lj.{t)  represent  respectively 


the  kinetic  f 


n Berg's  previous  w 


viscous 
f- dependent 
( of  the  glottal 
accordance 


= 12udi^/A^  (A-9) 

" .44PUg/A|  (A-10) 

where  u is  the  kinematic  viscosity  of  air  and  the  other  terms 
are  as  previously  defined.  The  subglottal  pressure  is  approx- 
imated by  the  variable  battery  Pg.  This  is  Justified  by  the 
fact  that  the  lungs  appear  as  a low  impedance  constant  pres- 

The  vocal  tract  was  modeled  by  a.i  articulatory  synthe- 
siser. Each  T-sBCtion  in  Figure  A-3  represents  a portion  of 
the  vocal  tract  having  cross-sectional  area  A^  = A^  tt) . The 


A., 


SSJr.SH'S.r.KSr'JKK'iSfr 

MS  + si  + kjX  + (V*^)x^  = t(tl 

k,  «e  th«  spring  constants  of  ana  K,, 

, and  f(t)  = ((Pg*Pj)/2  + Pgi.  The  other 


quate  in  its  inability  to  produce  other  physioloqlcal 
detail.  For  exatt^)le,  it  could  not  produce  different  regis- 
ters nor  could  it  account  for  phase  differences  within  the 
vocal  cords.  The  model  was  also  not  able  to  sustain  oscilla- 


tions under  a capacitive  input  load.  As  a consequence 
phonation  was  not  possible  for  frequencies  near  the  first 
forroant  since  in  this  region  the  vocal  tract's  input  im- 
pedance is  capacitive  III.  It  also  suffered  from  too  much 
source-tract  interaction  reflected  by  the  high  variations 
in  fundamental  frequency  dictated  by  changes  in  vocal  tract 

The  evolvement  of  a two-mass  vocal  cord  model  came  about 
as  an  effort  to  overcome  these  difficulties.  An  attempt  by 
Dudgeon  |68J  to  simulate  such  a model  proved  its  feasibility. 
However,  the  model  had  certain  flaws  due  to  design  criteria. 


1 and  Hatsudaira  [69]  designed  a two-mass 
rodel  of  the  vocal  cords  similar  in  principle  to  Flanagan's 
one-mass  model  and  was  later  computer-implemented  by  Isbizaka 
and  Flanagan  [17] . The  model  has  been  used  in  a speech  syn- 
thesis system  by  Flanagan  et  al.  I45J  and  more  recently  has 
been  implemented  so  that  its  control  parameters  are  derived 
from  printed  text  |461. 

Figure  A-5  gives  the  schematic  diagram  of  the  two-mass 
approximation  of  the  vocal  corde.  The  masses  are  only 
allowed  lateral  displacement  and  their  movement  can  be  de- 
scribed by  a pair  of  second-order  nonlinear  differential 
equations  with  time-varying  coefficients.  The  spring 


[171 


vl)  A Change  in  fundamental  frequency,  for  a fixed 
vocal  tract  configuration,  had  no  significant  influence  on 
the  shape  of  the  glottal  area  waveform.  However,  marked 
changes  were  observed  on  the  corresponding  volume  velocity 
waveform,  especially  for  low  firet-foroant  values  of  the 

vii)  Increasing  the  cord  tension  parameter  Q decreased 
the  amplitude  of  glottal  area  and  volume  velocity  while 
not  affecting  the  open  quotient,  phase  difference  (between 
the  two  masses),  and  the  shape  of  the  glottal  area.  The 
volume  velocity  waveform  was  however  affected,  since  an 
increase  in  Q increased  the  fundamental  frequency. 

The  overall  advantage  in  vocal  cord  modeling  is  the 
attainment  of  source-tract  coupling,  not  possible  in  linear 
systems  which  assume  linear  separability  between  source  and 
tract.  As  a consequence  a great  amount  of  physiological 
detail,  corresponding  to  the  human  vocal  structure,  is 
achieved.  To  investigate  the  effects  of  source-tract  inter- 
action Guerin  et  al.  (631  recently  proposed  the  model  shown 
in  Figure  A-6.  The  two  resonant  circuits  on  the  right  hand 
side  simulate  the  driving  point  impedance  of  the  vocal  tract. 
The  different  parameters  are  defined  as  follows:  ft)  = 

first  formant  frequency,  Fj(t)  ■ second  formant  frequency, 
Ljft)  • kj/F^(t),  Rj^(t)  = jCjF^Ct)  and  Lj  (t)  = kj/Fjft). 

Cj,  Cj  and  ®Fe  fixed,  .k^,  and  L are  defined  by  (A-9)- 


vocal  cords- 


APPENDIX 


METHOD  OF  LINEAR  PREDICTION  IN  SPEECH  SYNTHESIS 

The  recent  worXs  of  Hakhoul,  Wolf.  Markel  and  Gray 
(42,44,47.71,21  describe  the  nethod  of  linear  prediction. 
The  following  is  a general  outline  of  the  theory. 

In  vocal  tract  modeling,  the  main  objective  is  to 
obtain  a transfer  function  having  characteristics  similar 
to  the  human  vocal  tract.  Speech,  because  of  its  fluctua- 
tions in  pitch  and  vocal  tract  shape,  is  a nonstationary 
phenomenon,  that  is,  its  statistics  vary  with  time.  Never- 
theless, if  we  consider  a snail  segment  of  speech  on  the 
order  of  10-25  ns,  an  interval  over  which  the  vocal  tract 
does  not  change  appreciably  in  shape,  then  we  can  model  the 
human  vocal  tract  by  a linear  time  invariant  filter.  The 
diagram  given  in  Figure  B-1  illustrates  a speech  production 
model.  H(r),  the  vocal  tract  transfer  function  is  given  by 


Figure  B-1  A speech  production  model. 


H(z) 


a complex  frequency  variable. 


T ■ l/fg  sampling  period,  is 
and  A is  the  gain.  The  input  g(nT)  is 
and  this  need  not  be  unique.  Elowevcr, 
from  the  linear  prediction  method,  assi 
speech  the  input  consiste  of  a train  oi 
tion  is  equal  to  the  pitch  period  and  i 


:he  sampling  frequency 
;he  glottal  excitation 
:he  model,  as  derived 
tes  that  for  voiced 
impulses  whose  separa- 


of  impulses  and  noise  is  used  whenever  the  speech  is  both 


Equation  (B-1)  is  the  pole-zero  vocal  tract  model. 
Given  a speech  sample  s{t)  the  problem  now  becomes  one  of 
finding  the  coefficients  a^^,  b^, 

These  coefficients  completely  define  the  filter  HtzJ  * 
8{s)/G(z)  where  S(z)  and  0(z)  are  the  z-transform  of  sit) 
and  g(t)  respectively,  in  general,  this  is  a difficult 


linear  equations  I71J  and  must  resort  to  using  iterative 
solutions  to  arrivB  at  some  predescribed  minimal  error. 


the  transfer  function 
resonances  !421.  The 


of  the  vocal  tracts  consists  only  of 
transfer  function  thus  becomes 


H(«) 


Now  this  does  not  in  general  hold  (or  nasal  and  fricative 
speech  since  the  vocal  tract  generates  antiresonanoes  for 

does  in  fact  approximate  iB-1)  within  some  error  criteria. 
Makhocl  and  Wolf  1421  argue  that  the  all-pole  model  repre- 
sented by  (B-2)  gives  adequate  results  because  the  human 
auditory  system  is  considerably  more  sensitive  to  the  loca- 
tion of  a pole  than  to  the  location  of  a zero.  They  also 
point  out  that  the  sensitivity  of  the  human  ear  to  the  en- 
velope of  the  speech  spectrum  is  not  dependent  on  the  manner 
with  which  the  spectrum  is  generated. 

The  problem  now  reduces  to  one  of  using  linear  predic- 
tion to  find  the  filter  coefficients  a.  , 1 5 h r p and  then 
to  find  a suitable  input.  g(tl,  to  implement  the  filter. 

He  will  assume  that  at  a particular  instant  in  time  a speech 
sample  s{nT),  where  s(nT)  is  a sampled  version  of  s(t),  can 
be  approximated  by  a linearly  weighted  euinmation  of  past  p 

s(nT)  = f a^s(nT-ItT) 


the  sample  number. 


US 


and  define 


by 


where  T is  the  sampling  period,  n is 

is  the  approximation  of  s^,  and  a^,a^, , , , ,aj,  are  constants 
and  will  be  referred  to  as  the  linear  prediction  coefficients. 

Let  the  error  between  the  actual  value  and  the  one  pre- 
dicted by  (B-3)  be  given  by 


Now  various  error  minimization  criteria  can  be  used  in 
estimating  the  a.  's.  The  method  employed  in  linear  predic- 
tion is  the  least-squares  minimization  procedure  since  it 
leads  to  a mathematically  attractive  solution  [42).  Defining 
the  total-squared  error  by  E we  have 


Substituting 


(B-5)  gives 


He  see  that  the  range  of  summation  in  (B-6]  has  not  been 
specified.  This  leads  us  to  consider  two  distinct  cases, 
each  one  having  different  assumptions. 

In  the  covariance  method  the  range  of  n is  taken  ovej 
a finite  interval,  0 < n < N-1.  where  N is  some  integer. 
Equation  (B-6)  then  reduces  to 


^ik  = Jo  «n-i=n-k  - 

E arrive  at  a system  of  p linear  equations  w 


e autocorrelation  method  makes  the  assumption  t 
< •,  and  defines  s^  as  follows. 


nown  window  function, 
e conditions  in  [B-6) . 


114 


we  have  a system  of  p linear  equations  with  p unknowns. 

The  difference  between  the  two  methods  can  be  inter- 
preted as  one  assumes  that  the  signal  is  nonstationary  (co- 
variance method)  and  the  other  that  the  signal  is  stationary 
(autocorrelation).  Thus,  the  covariance  method  reduces  to 
the  autocorrelation  metliod  whenever  the  speech  signal  is 
stationary.  It  is  possible  to  give  other  formulations  for 
these  two  methods  (42),  but  the  ones  given  here  are  adequate 


The  effects  of  radiation  at  the  mouth  and  nostrils 
and  of  glottal  flow  play  an  important  part  in  the  overall 
filter  design.  Flanagan  (11  reports  that  the  effect  of 
radiation  at  the  mouth  and  nostrils  can  be  approximated  by 
differentiation.  This  can  be  represented  by  A^(l-r  '’)  ■ 
where  A.  is  a constant  dependent  on  the  amplitude  of  the 
volume  flow  at  the  lips  and  the  distance  from  the  lips  to 
the  microphone  [411.  The  overall  effect  due  to  glottal 
shape  can  be  approximated  by  two  poles  (411  which  leads  to 
an  expression  of  the  form  Aj/{l-a,2  (l-ajS  ^)  where  Aj  is 
a constant  proportional  to  the  intensity  of  glottal  flow, 

the  total  effect  due  to  radiation  and  glottal  flow  is  given 
by  the  transfer  function 


Atal  [41]  approximates  this  last  expression  to  an  all-pole 


W(2) 


and  indicates  t 


e approximation  i 


usually  a,  ^ 1. 


these  effects  into  consideration  (41-421.  The  value  of  p 
should  therefore  he  chosen  appropriately  so  that  it  takes 
into  acoount  formant  frequencies,  glottal  flow,  and  radia- 
tion. A good  value  of  p is  usually  around  13  to  16. 


APPENDIX  C 


PROGRAMS 


Exanple 


The  example  presented  in  this  section  is  aimed  at 
illustrating  the  use  o£  the  programs  previously  listed. 
The  programs  are  all  written  in  FORTRAN  (except  for  some 
of  the  subroutines,  which  ace  in  assembly  language)  and 
were  implemented  on  a Nova  2 minicomputer.  The  programs 


In  the  example  below,  the  program  LPCHOOTl  calculates 
the  filter  coefficients  and  stores  then  in  a disk  file. 
The  jitter  and  shimmer  parameters  are  then  derived  from 
the  original  speech  signal  by  the  program  JITSHIM.  The 
program  then  stores  these  two  parameters  in  separate  disk 
files.  The  third  program,  called  LPCJITSHIM,  reads  in 
from  disk  tiles  the  filter  coefficients;  the  jitter  and 
shimmer  parameters;  and  the  glottal  input  pitch  period. 
The  program  then  periodically  extends  the  glottal  cycle 
and  incorporates  the  jitter  and  shimmer  parameters.  The 
resultant  glottal  waveform  is  filtered  through  the  vocal 
tract  model  and  the  output  waveform  stored  in  a file  on 
disk.  The  final  process  of  incorporating  the  effect  of 
radiation  is  done  by  the  program  DIFFERENCE.  The  output 


speech  is  stored  on  a specified  disk  file. 

intervention  is  required  to  specify  input  and  output  data 
files,  and  information  regarding  formant  frequencies  and 
bandwidths.  The  first  program  requires  the  use  of  e dis- 


oscilloscopc  interfaced 


126 


puter  for  purposes  of  displaying  the  frequency  responses 
of  the  linear  prediction  filter  and  the  modified  filter. 

The  programs  are  called  from  dish  file  by  typing 
on  the  terminal  the  name  of  the  program.  Execution  is 
terminated  when  an  R is  displayed  on  the  terminal. 


DIF’PEXeNCE 


APPENDIX  D 

LISTENER  EVALUATION  FORM 


ClASSIFICATI 


continued 


continued 


12) 


different  different 


different  

2)  sane 

different 

different 

different 

different 

different 

5)  same 

different 

different  

different 

different  

7)  same 

different 

different  

different 

different 

different 

different 

10)  same 

different 

different 

different 

differen- 

different 

APPKNDIX 


DATA  FOR  SUBJECTS  MB  AMD  JB 

The  data  tor  sxibjects  MB  and  JB  are  presented  in 
this  appendix.  The  tables  hela»  correspond  to  Table  5.1 
of  subject  JH.  The  figures  presented  here  for  the  two 
subjects  follow  those  of  subject  JH,  that  Is,  Figures  E-1 

Chapter  V. 

The  analysis  of  the  data  for  these  two  subjects  is 
similar  to  that  of  subject  JH.  An  apparent  difference  is 
that  for  these  two  subjects  the  spectrum  of  the  synthetic 
speech,  for  frequencies  below  half  the  oaoiera's  sampling 
frequency,  does  not  coincide  with  the  spectrum  of  the  real 
speech  as  weli  as  that  for  subject  JH.  This  can  probably 
be  attributed  to  the  high  fundamental  frequency  of  subjects 

area  made  by  the  operator  for  these  two  subjects. 


able  E-1 


135 


NOriHfiLIZED  GLOTTRL  AREA 


gj I t , I 1 

'D.OO  l.OD  !.m  3-00  1.00  5.00 

FREQUENCT  (KHZl 


(a] 


ricnire  E-2  Frequency  res^jonse  of  a)  linear  prediction  filter 
of  order  p - 16-  and  b)  modified  vocal  tract  fil- 
ter of  order  p - 8.  An  18  ms  speech  record  of  sub- 
ject MB  was  used  for  the  analysis. 


(h) 


1-3  a)  Natural  speacti  of  subject  HB. 
bl  Synthetic  speech  of  subject  MB. 


(b) 


Figure  E-4  Spectrum  of  ai  natural  speech,  and  fcl  synthetic 
speech  of  subject  MB.  A Hanning  window  was  nsed. 


NORMRLIZED  GLOTTAL  RREfl 


(dj 

Figure  E-«  c)  Pitch  period  of  interpolated  glottal  s 
waveforn  of  subject  JB.  dl  fft  of  above. 
A rectangular  window  was  uaed. 


(a) 


Figure  E-7  Frequency  response  of  a)  linear  prediction  filter 
of  order  p s 16,  and  b)  modified  vocal  tract 
filter  ol  order  p = 8.  An  18  ms  speech  record 
of  subiect  JB  was  used  for  the  analysis. 


141 


g.K|TLj 


ii:  ’ATZ":  ITIT.  - 


BIOGRAPHICAL  SKETCH 


8 secondary 


n New  York  City  on  October  20» 
lived  there  until  completing  h 
t Bishop  Dubois  High  School. 

In  1966  he  moved  to  Puerto  Rico  to  further  his 
education  at  the  University  of  Puerto  Rico.  Hayaguez 
Campus.  During  the  years  1966  to  1971  he  finished  his 
Bachelor  of  Science  degree,  with  high  honors,  in  both 
Electrical  Engineering  and  Mathematics,  and  from  1971  to 
1973  held  a graduate  teaching  assistantship  at  the  Uni- 
versity of  Puerto  Rico's  Department  of  Mathematics. 

He  accepted,  in  1977,  a teaching  position  as  an 
instructor,  in  the  Department  of  Electrical  Engineering, 
at  the  same  university.  He  taught  from  1973  to  1974  and 
simultaneously  worked  on  his  Master's  Thesis  in  the  field 
of  Mathematics.  Upon  completion  of  his  Master's  Degree 
in  July.  1974.  he  was  granted  a leave  of  absence  from  the 
University  of  Puerto  Rico  to  pursue  doctoral  studies  in 
Electrical  Engineering,  with  a concentration  in  Bio- 
engineering. at  the  University  of  Florida. 

He  is  a member  of  IEEE,  Sigma  Xi.  CIAA,  and  is  a 
licensed  engineer  in  Puerto  Rico. 


A-^'grwf 


-I  ■. 


This  aissertation  was  submittea  to  the  Graduate  Faculty  of 
the  College  of  Engineering  and  to  the  Graduate  Council,  and 
was  accepted  as  partial  fulfillment  of  the  requirements  for 
the  degree  of  Doctor  of  Philosophy. 

December,  1916  , 

Dean,  College  o.  Engineering 


an.  Graduate  School 


